Machine learning is the study that allows computers to adaptively improve their performance with experience accumulated from the data observed. Our two sister courses teach the most fundamental algorithmic, theoretical and practical tools that any user of machine learning needs to know. This first course of the two would focus more on mathematical tools, and the other course would focus more on algorithmic tools.
Machine Learning Foundations
Topic 1: When can machines learn?
###Lecture 1: The Learning Problem
A takes D and H to get g
Course Introduction
foundation oriented and story-like
What is Machine Learning
use data to approximate target
Applications of Machine Learning
almost everywhere
Components of Machine Learning
A takes D and H to get g
Machine Learning and Other Fields
related to Date Mining, Artificial Intelligence and Statstics
Lecture 2: Learning to Answer Yes/No
PLA A takes linear separable D and perceptrons H to get hypothesis g
Perceptron Hypothesis Set
hyperplanes/linear classifiers in Rd
Perceptron Learning Algorithm (PLA)
correct mistakes and improve iteratively
Guarantee of PLA
no mistake eventually if linear separable
Non-Separable Data
hold somewhat ‘best’ weights in pocket
Lecture 3: Types of Learning
focus: binary classification or regression from a batch of supervised data with concrete features
Learning with Different Output Space Y
[classification], [regression], structured
Learning with Different Data Label yn
[supervised], un/semi-supervised, reinforcement
Learning with Different Protocol f ⇒ (xn, yn)
[batch], online, active
Learning with Different Input Space X
[concrete], raw, abstract
Lecture 4: feasibility of learning
learning is PAC-possible if enough statistical data and finite |H|
Learning is Impossible?
absolutely no free lunch outside D
Probability to the Rescue
probably approximately correct outside D
Connection to Learning
verification possible if Ein(h) small for fixed h
Connection to Real Learning
learning possible if |H| finite and Ein(g) small
Topic 2: Why Can Machines Learn?
Lecture 5: training versus testing
effective price of choice in training: (wishfully) growth function mH(N) with a break point
Recap and Preview
two questions: Eout(g) ≈ Ein(g), and Ein(g) ≈ 0
Effective Number of Lines
at most 14 through the eye of 4 inputs
Effective Number of Hypotheses
at most mH(N) through the eye of N inputs
Break Point
when mH(N) becomes ‘non-exponential’
Lecture 6: theory of generalization
Eout ≈ Ein possible if mH(N) breaks somewhere and N large enough
Restriction of Break Point
break point ‘breaks’ consequent points
Bounding Function: Basic Cases
B(N,k) bounds mH(N) with break point k
Bounding Function: Inductive Cases
B(N,k) is poly(N)
A Pictorial Proof
mH(N) can replace M with a few changes
Lecture 7: the VC dimension
learning happens if finite dVC, large N, and low Ein
Definition of VC Dimension
maximum non-break point
VC Dimension of Perceptrons
dVC(H) = d + 1
Physical Intuition of VC Dimension
dVC ≈ #free parameters
Interpreting VC Dimension
loosely: model complexity & sample complexity
Lecture 8: noise and error
learning can happen with target distribution P(y|x) and low Ein w.r.t. err
Noise and Probabilistic Target
can replace f(x) by P(y|x)
Error Measure
affect ‘ideal’ target
Algorithmic Error Measure
user-dependent =⇒ plausible or friendly
Weighted Classification
easily done by virtual ‘example copying’
Topic 3: How Can Machines Learn?
Lecture 9: linear regression
analytic solution wLIN = X†y with linear regression hypotheses and squared error
Linear Regression Problem
use hyperplanes to approximate real values
Linear Regression Algorithm
analytic solution with pseudo-inverse
Generalization Issue
Eout − Ein ≈ 2(d+1) on average N
Linear Regression for Binary Classification
0/1 error ≤ squared error
Lecture 10: logistic regression
gradient descent on cross-entropy error to get good logistic hypothesis
Logistic Regression Problem
P(+1|x) as target and θ(wT x) as hypotheses
Logistic Regression Error
cross-entropy (negative log likelihood)
Gradient of Logistic Regression Error
θ-weighted sum of data vectors
Gradient Descent
roll downhill by −∇Ein(w)
Lecture 11: linear models for classification
binary classification via (logistic) regression; multiclass via OVA/OVO decomposition
Linear Models for Binary Classification
three models useful in different ways
Stochastic Gradient Descent
follow negative stochastic gradient
Multiclass via Logistic Regression
predict with maximum estimated P(k|x)
Multiclass via Binary Classification
predict the tournament champion
Lecture 12: nonlinear transformation
nonlinear X via nonlinear feature transform Φ plus linear X with price of model complexity
Quadratic Hypotheses
linear hypotheses on quadratic-transformed data
Nonlinear Transform
happy linear modeling after Z = Φ(X )
Price of Nonlinear Transform
computation/storage/[model complexity]
Structured Hypothesis Sets
linear/simpler model first
Topic 4: How Can Machines Learn Better?
Lecture 13: hazard of overfitting
overfitting happens with excessive power, stochastic/deterministic noise, and limited data
What is Overfitting?
lower Ein but higher Eout
The Role of Noise and Data Size
overfitting ‘easily’ happens!
Deterministic Noise
what H cannot capture acts like noise
Dealing with Overfitting
data cleaning/pruning/hinting, and more
Lecture 14: regularization
minimizes augmented error, where the added regularizer effectively limits model complexity
Regularized Hypothesis Set
original H + constraint
Weight Decay Regularization
add λ/N*wTw in Eaug
Regularization and VC Theory
regularization decreases dEFF
General Regularizers
target-dependent, [plausible], or [friendly]
Lecture 15: validation
(crossly) reserve validation data to simulate testing procedure for model selection
Model Selection Problem
dangerous by Ein and dishonest by Etest
Validation
select with Eval(Am(Dtrain)) while returning Am∗ (D)
Leave-One-Out Cross Validation
huge computation for almost unbiased estimate
V -Fold Cross Validation
reasonable computation and performance
Lecture 16: three learning principle
Occam’s Razor
simple, simple, simple!
Sampling Bias
match test scenario as much as possible
Data Snooping
any use of data is ‘contamination’
Power of Three
relatives, bounds, models, tools, principles
Machine Learning Techniques
Topic 1: Embedding Numerous Features: Kernel Models
Lecture 1: Linear Support Vector Machine
linear SVM: more robust and solvable with quadratic programming
Course Introduction
from foundations to techniques
Large-Margin Separating Hyperplane
intuitively more robust against noise
Standard Large-Margin Problem
minimize ‘length of w’ at special separating scale
Support Vector Machine
‘easy’ via quadratic programming
Reasons behind Large-Margin Hyperplane
fewer dichotomies and better generalization
Lecture 2: Dual Support Vector Machine
dual SVM: another QP with valuable geometric messages and almost no dependence on d ̃
Motivation of Dual SVM
want to remove dependence on d ̃
Lagrange Dual SVM
KKT conditions link primal/dual
Solving Dual SVM
another QP, better solved with special solver
Messages behind Dual SVM
SVs represent fattest hyperplane
Lecture 3: Kernel Support Vector Machine
kernel as a shortcut to (transform + inner product) to remove dependence on d ̃: allowing a spectrum of simple (linear) models to infinite dimensional (Gaussian) ones with margin control
Kernel Trick
kernel as shortcut of transform + inner product
Polynomial Kernel
embeds specially-scaled polynomial transform
Gaussian Kernel
embeds infinite dimensional transform
Comparison of Kernels
linear for efficiency or Gaussian for power
Lecture 4: Soft-Margin Support Vector Machine
allow some margin violations ξn while penalizing them by C; equivalent to upper-bounding αn by C
Motivation and Primal Problem
add margin violations ξn
Dual Problem
upper-bound αn by C
Messages behind Soft-Margin SVM
bounded/free SVs for data analysis
Model Selection
cross-validation, or approximately nSV
Lecture 5: Kernel Logistic Regression
two-level learning for SVM-like sparse model for soft classification, or using representer theorem with regularized logistic error for dense model
Soft-Margin SVM as Regularized Model
L2-regularization with hinge error measure
SVM versus Logistic Regression
≈ L2-regularized logistic regression
SVM for Soft Binary Classification
common approach: two-level learning
Kernel Logistic Regression
representer theorem on L2-regularized LogReg
Lecture 6: Support Vector Regression
kernel ridge regression (dense) via ridge regression + representer theorem; support vector regression (sparse) via regularized tube error + Lagrange dual
Kernel Ridge Regression
representer theorem on ridge regression
Support Vector Regression Primal
minimize regularized tube errors
Support Vector Regression Dual
a QP similar to SVM dual
Summary of Kernel Models
with great power comes great responsibility
Topic 2: Combining Predictive Features: Aggregation Models
Lecture 7: Blending and Bagging
blending known diverse hypotheses uniformly, linearly, or even non-linearly; obtaining diverse hypotheses from bootstrapped data
Motivation of Aggregation
aggregated G strong and/or moderate
Uniform Blending
diverse hypotheses, ‘one vote, one value’
Linear and Any Blending
two-level learning with hypotheses as transform
Bagging (Bootstrap Aggregation)
bootstrapping for diverse hypotheses
Lecture 8: Adaptive Boosting
optimal re-weighting for diverse hypotheses and adaptive linear aggregation to boost ‘weak’ algorithms
Motivation of Boosting
aggregate weak hypotheses for strength
Diversity by Re-weighting
scale up incorrect, scale down correct
Adaptive Boosting Algorithm
two heads are better than one, theoretically
Adaptive Boosting in Action
AdaBoost-Stump useful and efficient
Lecture 9: Decision Tree
recursive branching (purification) for conditional aggregation of constant hypotheses
Decision Tree Hypothesis
express path-conditional aggregation
Decision Tree Algorithm
recursive branching until termination to base
Decision Tree Heuristics in C&RT
pruning, categorical branching, surrogate
Decision Tree in Action
explainable and efficient
Lecture 10: Random Forest
bagging of randomized C&RT trees with automatic validation and feature selection
Random Forest Algorithm
bag of trees on randomly projected subspaces
Out-Of-Bag Estimate
self-validation with OOB examples
Feature Selection
permutation test for feature importance
Random Forest in Action
‘smooth’ boundary with many trees
Lecture 11: Gradient Boosted Decision Tree
aggregating trees from functional gradient and steepest descent subject to any error measure
Adaptive Boosted Decision Tree
sampling and pruning for ‘weak’ trees
Optimization View of AdaBoost
functional gradient descent on exponential error
Gradient Boosting
iterative steepest residual fitting
Summary of Aggregation Models
some cure underfitting; some cure overfitting
Topic 3: Distilling Implicit Features: Extraction Models
Lecture 12: Neural Network
automatic pattern feature extraction from layers of neurons with backprop for GD/SGD
Motivation
multi-layer for power with biological inspirations
Neural Network Hypothesis
layered pattern extraction until linear hypothesis
Neural Network Learning
backprop to compute gradient efficiently
Optimization and Regularization
tricks on initialization, regularizer, early stopping
Lecture 13: Deep Learning
pre-training with denoising autoencoder (non-linear PCA) and fine-tuning with backprop for NNet with many layers
Deep Neural Network
difficult hierarchical feature extraction problem
Autoencoder
unsupervised NNet learning of representation
Denoising Autoencoder
using noise as hints for regularization
Principal Component Analysis
linear autoencoder variant for data processing
Lecture 14: Radial Basis Function Network
linear aggregation of distance-based similarities using k-Means clustering for prototype finding
- RBF Network Hypothesis
prototypes instead of neurons as transform
- RBF Network Learning
linear aggregation of prototype ‘hypotheses’
- k-Means Algorithm
clustering with alternating optimization
- k-Means and RBF Network in Action
proper choice of # prototypes important
Lecture 15: Matrix Factorization
linear models of movies on extracted user features (or vice versa) jointly optimized with stochastic gradient descent
- Linear Network Hypothesis
feature extraction from binary vector encoding
- Basic Matrix Factorization
alternating least squares between user/movie
- Stochastic Gradient Descent
efficient and easily modified for practical use
- Summary of Extraction Models
powerful thus need careful use
Lecture 16: Finale
Feature Exploitation Techniques
kernel, aggregation, extraction, low-dimensional
Error Optimization Techniques
gradient, equivalence, stages
Overfitting Elimination Techniques
(lots of) regularization, validation
Machine Learning in Practice
welcome to the jungle
Notes
Terms
- 正则 岭回归
- 伪逆矩阵
- 离散函数
- 拉格朗日对偶法
- 拉格朗日最小化
- 拉格朗日乘法
- 空间转换
- 原始SVM
- 对偶SVM
- Decision Stump
L1-5 25
L2-19
L4-18 19
L5-4 5 9 15 17
L7基于以上的两个例子,我们得到了aggregation的两个优势:feature transform和regularization。我们之前在机器学习基石课程中就介绍过,feature transform和regularization是对立的,还把它们分别比作踩油门和踩刹车。如果进行feature transform,那么regularization的效果通常很差,反之亦然。也就是说,单一模型通常只能倾向于feature transform和regularization之一,在两者之间做个权衡。但是aggregation却能将feature transform和regularization各自的优势结合起来,好比把油门和刹车都控制得很好,从而得到不错的预测模型。
L07-20 bootstrap aggregation (BAGging): a simple meta algorithm on top of base algorithm A
L07-21 值得注意的是,只有当演算法对数据样本分布比较敏感的情况下,才有比较好的表现。
L08-20
L09-10
L10-2Bagging能减小variance,而Decision Tree能增大variance
L11-10
L12-6感知机模型每个节点的输入就对应神经元的树突dendrite,感知机每个节点的输出就对应神经元的轴突axon
L12-13 16GD 21
L14-5 9 10